Comparison of deep transfer learning algorithms and transferability measures for wearable sleep staging

Background Obtaining medical data using wearable sensors is a potential replacement for in-hospital monitoring, but the lack of data for such sensors poses a challenge for development. One solution is using in-hospital recordings to boost performance via transfer learning. While there are many possible transfer learning algorithms, few have been tested in the domain of EEG-based sleep staging. Furthermore, there are few ways for determining which transfer learning method will work best besides exhaustive testing. Measures of transferability do exist, but are typically used for selection of pre-trained models rather than algorithms and few have been tested on medical signals. We tested several supervised transfer learning algorithms on a sleep staging task using a single channel of EEG (AF7-Fpz) captured from an in-home commercial system. Results Two neural networks—one bespoke and another state-of-art open-source architecture—were pre-trained on one of six source datasets comprising 11,561 subjects undergoing clinical polysomnograms (PSGs), then re-trained on a target dataset of 75 full-night recordings from 24 subjects. Several transferability measures were then tested to determine which is most effective for assessing performance on unseen target data. Performance on the target dataset was improved using transfer learning, with re-training the head layers being the most effective in the majority of cases (up to 63.9% of cases). Transferability measures generally provided significant correlations with accuracy (up to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$r = -0.53$$\end{document}r=-0.53). Conclusion Re-training the head layers provided the largest performance boost. Transferability measures are useful indicators of transfer learning effectiveness.

manually inspect the signals in 30-s 'epochs' and determine whether the patient is awake, in rapid eye movement (REM) sleep, stage 1 sleep, stage 2 sleep, or stage 3 sleep according to the American Academy for Sleep Medicine rules [1]. Various statistics, such as the time spent in stage 3, time spent in REM or time spent waking up throughout the night are used as diagnostic tools. The procedure is expensive, however. A single polysomnogram can cost as much as $4000 [2] with the cost of sleep staging alone amounting to $800 [3]. Sleep stage scores can also be inconsistent, with technicians having an interrater reliability of 83% [4].
PSG is also unsuitable for long-term monitoring, as it typically requires special equipment and training to set up the equipment, and so patients cannot simply take recordings on themselves. Long-term monitoring of sleep disorder treatment is instead conducted using surveys of subjective sleep quality or having the patient note the time at which they went to bed or woke up each night, however, these methods are highly inaccurate compared to PSG [5] and are unable to measure certain metrics such as REM onset latency or total time spent in deep sleep. Sleep can be measured more accurately and with more detail using actigraphy [5], however this is still less accurate than PSG [5].

Wearable medical devices
Wearable medical devices such as the Actiwatch (Philips, Amsterdam, Netherlands), Apple Watch (Apple, Cupertino, CA, USA), Sleep Profiler (Advanced Brain Monitoring, Carlsbad, CA, USA) and Dreem Headband (Dreem, Paris, France) have seen a surge in interest for their use as an alternative to in-hospital monitoring for a number of tasks, including PSG. Wearables can be used continuously in an at-home setting for lengthy periods of time by an untrained user, thereby facilitating long-term monitoring and eliminating the need for patients to be brought into a hospital.
Processing the comparatively large amount of data has driven the push for automated processing of medical data using machine learning [6][7][8][9][10][11][12]. However, the small amount of data available from wearable devices is an obstacle, as machine learning algorithms are immensely data-hungry. Furthermore, machine learning models trained on in-hospital recordings will not achieve good performance on wearable recordings due to a multitude of differences between in-hospital and wearable recordings, including signal quality, sensor location, available modalities, and differing pathologies between patients receiving a diagnostic polysomnogram vs those requiring long-term monitoring.

Transfer learning
Transfer learning is the process of boosting machine learning performance on one domain (referred to as the 'target' domain with sample set X t ) by pre-training the model on another, similar domain (referred to as the 'source' domain with sample set X s ), thereby compensating for potentially insufficient target data by allowing the model to apply knowledge it gained from the source domain. Transfer learning poses a potential solution to the lack of data available from wearable devices. A large amount of data are available from recordings taken in a hospital using conventional medical sensors, which could potentially be used to boost machine learning performance on wearable devices.

Limitations in current transfer learning research
There are several areas of transfer learning research which are underexplored for sleep staging. Much of supervised transfer learning is done using very simple methods such as re-training a few layers of the model (hereafter referred to as head re-training) or retraining the entire model at a smaller learning rate [13][14][15]. However, there are other more sophisticated transfer learning methods such as Correlation Alignment (CORAL) [16], Deep Domain Confusion (DDC) [17], and Subspace Alignment (SA) [18] which are rarely tested on sleep staging tasks. Alternatives to fully supervised domain adaptation include semi/unsupervised transfer learning [19,20], meta-learning [21], pre-training on a related but separate task [22], and transfer learning onto individual subjects [19,23]. Such approaches, however, have shortcomings such as not utilizing labeled data from the target domain, the need for additional labeled data that are rarely collected in clinical tasks, or the need for models to become specialized for a single subject. Therefore, the automated sleep staging field would generally benefit from a greater understanding of which fully supervised transfer learning methods would work best and when. More research comparing transfer learning techniques head-to-head is required.
Similarly, various design decisions must be made when re-training models in addition to the choice of transfer learning algorithm, such as which architecture to use, which layers to re-train and which datasets to pre-train on. There is some research on measures of transferability between datasets which are potentially useful for determining which of several pre-trained models or datasets to use in transfer learning, but again, these are rarely tested and there is little research comparing methods head-to-head. There is also little research on using transferability measures for deciding which of several transfer learning algorithms to use.
Lastly, most transfer learning research focuses on computer vision or natural language processing tasks-comparatively little research focuses on transfer learning for medical tasks, which poses a problem for medical machine learning researchers as they may erroneously use techniques which work well for computer vision or natural language processing but not on medical signals.
We tested several popular transfer learning algorithms in a supervised setting. Several publicly available in-hospital PSG datasets were used as source datasets and the target dataset was 75 recordings taken on 24 healthy adult volunteers using a wearable EEG sensor. Several transferability measures were also tested to determine which was most strongly correlated with accuracy on unseen data.

Head re-training
One of the simplest and more widely used transfer techniques is simply freezing every layer except for the few closest layers to the output and re-training the unfrozen layers on the target. This method was used as a baseline. After pre-training the model on the source, every layer except for a single dense layer adjacent to the output was frozen and the model was re-trained on the target. t[1:r] is the matrix square root of the Moore-Penrose pseudoinverse of the matrix of r largest singular values of C t and r is the rank of either C s or C t , whichever is smaller. Once A is found, it is used to transform the source data so that they more closely resemble the target data, after which training on the transformed source data proceeds as normal.
Although CORAL was originally designed to be unsupervised, it is easy to modify to be supervised by training the model on both the transformed source and untransformed target data. For this work, we also used a modified version of CORAL of our own design which takes class into account when learning the transformations. The modified CORAL will be referred to as Per-Class CORAL. Per-Class CORAL computes a different transformation A i for each class by aligning the covariances of source samples in class i with the covariances of target samples which are also in class i. Each class in the source dataset is then transformed individually.
CORAL can be applied to deep learning by performing the described transformations on learned features φ(x s ) and φ(x t ) obtained using the output from some layer of the base model, then re-training the succeeding layers of the model on the transformed features. For this work, we froze every layer of the base model from the input up to and including the convolutional layer closest to the output, then performed the CORAL algorithm on the activations from the convolutional layer closest to the output, then re-trained the unfrozen layers on the target and transformed source activations.

Deep domain confusion
Instead of transforming learned features, DDC works by training models in which the learned features differ little between the source on target to begin with. Training invariant features is done by adding an additional loss function equal to the maximum mean discrepancy [24] (MMD) between source and target samples within each batch. MMD is a measure of the difference between probability distributions which finds the distance between the average kernel of the kernel embeddings k(x s ) and k(x t ) for source dataset X s and target dataset X s : where H k is a reproducing kernel Hilbert space with characteristic kernel k. Using the property that < x, y > H k = k(x, y) allows 2 to be further simplified: Note that unlike other measures of distance between probability distributions such as KL-divergence or Wasserstein distance, MMD can be computed directly from samples, and does not require an estimate of the probability density. Time consuming and potentially inaccurate computations of probability density such as through kernel density estimation are thus unnecessary. 4 is computed with quadratic time complexity, but if the samples x s and x t are independent and identically distributed and |X s | = |X t | = n , an unbiased estimate can be used which can be computed in linear time [24]: The efficiency at which MMD can be calculated makes it useful in machine learning algorithms in which it may need to be computed repeatedly.
In DDC, the MMD of activations at one or several layers is calculated between source and target samples in each batch and used as a loss function in addition to the standard cross-entropy loss. In doing so, the neural network is incentivized to learn features which are very similar between source and target, and so the model can be simultaneously trained on both source and target datasets without losing accuracy on the target. To maintain consistency with the other transfer learning methods we implemented, we chose the final convolutional layer as the layer at which to calculate the MMD loss.

Subspace alignment
SA works by projecting both the source and target data onto two lower-dimensional linear subspaces, then using a linear transformation M to align the source subspace with the target subspace. The lower-dimensional subspaces for both the source and target are  where V s and V t are matrices of the basis vectors for the source and target subspaces.
A new source sample x s can then transformed into the target subspace using x T s V s M . In the fully or semi-supervised setting, the source and target samples are transformed into the target subspace, after which a model can be trained on both. As with CORAL, we froze each layer in the base model from the input to and including the convolutional layer closest to the output, then performed SA on the activations from the convolutional layer closest to the output, then re-trained the unfrozen layers on the transformed activations. We choose the dimensionality of the transformed features to be 100, which we determined via grid search on a subset of 7 target subjects.

Log expected empirical prediction
Log expected empirical prediction (LEEP) is a measure of transferability which creates a simple Bayesian classifier which attempts to classify target samples based on the outputs of a model trained on the source dataset: where z are the label outputs of the pre-trained model, θ(x t,i ) z is the model's estimated probability that sample x t,i has label z, and i:y i =y indicates all the samples whose true label is y. LEEP T (θ) for a model θ is equal to the average log-likelihood of the Bayesian classifier on the target domain: LEEP can be considered a measure of how well the Bayesian classifier performs on the target dataset, which by extension indicates how well the model will perform on the target dataset after a small amount of re-training, since the Bayesian model performs classification using the pre-trained model's outputs.

H-score
Bao et al. [25] show that the following is minimized when a given feature extractor φ is optimal: H(φ) can thus be used as indicator of how well suited a classifier with learned feature extractor φ is for a particular dataset. Note that the exact value of H(φ) will depend on the dataset, and thus that a higher H-score on one target dataset does not mean a model re-trained on that dataset will necessarily perform better than a model trained on another dataset with a lower H-score. Bao et al. [25] also derives a value they dub transferability which can be compared across different target datasets by normalizing H-score by its theoretical minimum possible value, however it is not necessary to do so to determine which of several possible models will perform best on a fixed dataset, and the calculation of transferability requires a more time-consuming iterative procedure. We therefore performed all testing using H-score and not transferability in order to reduce computation time.

Hypothesis margin
One practical advantage of LEEP and H-score is that they only require the model and the target dataset to compute-it is not necessary to have any data from the source dataset. However, this is also a disadvantage from a theoretical standpoint as it is more difficult to interpret what characteristics of the source dataset make it effective for pre-training. We thus propose using several measures of statistical characteristics of datasets which have more concrete interpretations.
Hypothesis margin is a measure of the margins between sets of points with differing labels, which has been used in feature selection [26][27][28][29] and in loss functions for machine learning [30]. The hypothesis margin M(x) for a single point x is: where nearmiss(x) is the nearest point to x which is in a different class from x and nearhit(x) is the nearest point to x which is in the same class. We used the average hypothesis margin M (X t , X s ) to study the margin between learned features extracted from points in the source dataset and learned features extracted from points in the target dataset using the feature extractor φ from a model pre-trained on the source dataset: In this work, we downsample X t and X s by a factor of 10 to reduce the computational resources used in computing the distances between each pair of samples. We hypothesized that when there is less of a margin between the learned features from the source and target dataset, performance on the target dataset will be better because it requires less adjustment for the model to achieve good performance.

Silhouette score
Similarly, we hypothesized that when there is a greater degree of overlap between learned features of the source and target dataset, transfer learning performance will be better due to the smaller amount of adjustment necessary to make to the model. We measured degree of overlap using silhouette score: where A(x) is the average L2 distance from point x to every other point in the same class and B(x) is the average L2 distance from x to every point in a different class. Silhouette score is a measure of the degree of overlap between sets of points of differing classes or clusters, and is most often used in evaluating the quality of clustering algorithms [31,32].
Silhouette score between the learned features from the source and target S(φ(X t ), φ(X s )) is: In this case, A(φ(x)) is then the average L2 distance from φ(x) to every point in the same dataset and B(φ(x)) is the average L2 distance from φ(x) to every point in the other dataset.

Target density around source
Target density around source [18] (TDAS) is a measure of the local density of target samples within some neighborhood of the source samples, and is mainly intended for use in nearest-neighbor models [18]. Let sim(x s , x t ) = (x s V s M)(x t V t ) T for x s and x t are a source and target sample, V s , V t are the subspace bases found through PCA on the source and target, and M is a transformation for aligning the source and target basis vectors as explained in Section 1.5.4. sim(x s , x t ) can be considered a measure of similarity between a source and target sample following alignment of their lower-dimensional projections. To measure the transferability between two datasets, TDAS is defined as the average number of target samples that have similarity of at least ǫ to a given target sample: We chose ǫ to be the median Euclidean distance between samples in the target dataset m multiplied by either .1, 1 or 10. The reason why we use ǫ at 3 different values is to evaluate the sensitivity of TDAS to ǫ . The reason we chose to tie ǫ to the median value instead of using some fixed value was so that the effectiveness of TDAS would be more stable across target datasets of differing pairwise distances between samples.

Maximum mean discrepancy
As explained in Section 1.5.3, MMD is a measure of the similarity between probability distributions which is used in machine learning, both for training algorithms [17,33] and in making design decisions for transfer learning [17]. We use the radial basis function for the kernel with width parameter γ parameter equal to .1Ŵ , 1 Ŵ or 10Ŵ , where Ŵ = −1/(2 * M) . Using the value γ = 1Ŵ is the median heuristic [24,34] for calculating (14)  γ , but as with TDAS we computed MMD using the parameter value increased or decreased in order to observe the sensitivity of the MMD measure to γ.

Results
Several publicly available PSG datasets using standard 10-20 scalp montages were used as source domains: Sleep Heart Health Study dataset (SHHS) [35,36], the Computing in Cardiology Challenge 2018 dataset (CiCC) [37,38], the Institute of Systems and Robotics, University of Coimbra dataset (ISRUC) [39], the Osteoporotic Fractures in Men Study dataset (MrOS) [35,40], The Montreal Archive of Sleep Studies (MASS) [41], and the Wisconsin Sleep Cohort (WSC) [35,42]. The target dataset consisted of 75 recordings from 24 subjects we obtained using an X4 Sleep Profiler (Advanced Brain Monitoring, Carlsbad, CA)-a commercially available wearable EEG sensor. Two architectures were used: a novel and relatively simple bespoke (13-layer convolutional neural network (CNN)) model that we designed for use on resource-constrained body-worn systems (Fig 1); and a more computationally intensive contemporary open-source algorithm called DeepSleepNet [43] (Fig 2). This algorithm was selected due to its state-ofthe-art performance on sleep staging tasks, its open architecture, and for its frequent use as a basis of comparison by other researchers [7,22,[44][45][46][47]. DeepSleepNet contains 35 layers, including both convolutional and long short-term memory (LSTM) layers [43]. Table 1 lists the leave-one-subject-out cross-validation accuracies attained on the target dataset when using each source dataset for pre-training and each transfer learning algorithm for re-training. Table 2 lists the leave-one-subject-out Cohen's κ values. Table 3 lists the fraction of instances in which a particular learning algorithm outperformed other techniques. Every transfer learning algorithm had better than or equivalent  performance to the baseline method of head re-training except for SA, which did noticeably worse. CORAL and Per-Class CORAL were not significantly better than head retraining ( p > 0.05 , n = 144 ). DDC (indicated in bold) was significantly better than head re-training ( p < 0.05 , n = 144 ). DDC also performed better than any other method a majority (52.1%) of the time. As an additional baseline, we also trained the model from random initialization on the target dataset without pre-training and achieved an accuracy of 77.0%, which is slightly better than head re-training but slightly below the highest achieving method of DDC. We also tested re-training the entire model instead of just the head, which achieved a similar performance of 77.3%. Table 4 lists the Spearman's correlations of each transferability measure with the accuracy of a bespoke CNN model re-trained using each transfer learning method. Every transferability measure except hypothesis margin achieved significant ( p < 0.05 ) correlations with accuracy when using at least one transfer learning algorithm. This continued to be true even after a Bonferroni adjustment ( p < 0.0017 ). H-score achieved the highest overall correlation ( r = 0.23 ). Measures were generally more strongly correlated  Table 4 Correlations of each transferability measure with CNN accuracy for individual algorithms as well as overall with accuracies within individual algorithms than across all algorithms. The correlation strengths of MMD and TDAS also varied by the value of their respective parameters.

Effect of re-training layers on performance
To test the sensitivity of each algorithm to the layer on which the domain transformation is applied, we re-tested the bespoke CNN when re-training additional layers (Tables 5  and 6). See Section 5.5 for training details.
As with Section 2.0.1, Table 7 lists the fraction of instances in which a particular learning algorithm outperformed other techniques. All algorithms performed better except DDC, which performed worse. With the exception of DDC, the algorithms' performances relative to each other are similar to when re-training a smaller number of layers (i.e., Head Re-train, CORAL, and Per-Class CORAL performed similarly while SA performed the worst). Head Re-train outperformed all other transfer learning algorithms on the largest number of cases, but did not have a significantly higher average accuracy than CORAL ( p > 0.05 ). Head Re-train, CORAL, and Per-Class CORAL also now outperformed the baseline of training from random initialization without transfer learning. Table 8 lists the correlations between each measure of transferability and accuracy   achieved when re-training a larger number of layers. Correlations are generally lower, but TDAS and H-score continue to be the most strongly correlated with accuracy and achieve statistical significance in half of all cases, even after applying Bonferroni correction ( p < 0.001).

Testing on differing network architectures
We also tested each transfer learning algorithm and transferability measure on the open-source sleep staging model DeepSleepNet [43] to determine how performance is affected by changes in the network architecture (Tables 9 and 10). Note that DDC was not tested on DeepSleepNet because the derivation of DDC assumes that each sample is presented in a random order during training [17,24], an assumption which is violated in recurrent models. As with the bespoke CNN, SA exhibited the lowest performance and CORAL, Per-Class CORAL and Head Re-train all exhibited similar (higher) performance (Table 11), but with Head Re-train now being statistically significantly better than CORAL ( p < 0.01 ). However, no transfer learning method was able to out-perform the baseline of training from random initialization without transfer learning, which achieved an accuracy of 74.5%.  Correlations between accuracy and each transferability measure were highest for TDAS and MMD (Table 12). Unlike with the bespoke model, H-score's correlation with accuracy no longer achieves Bonferroni-adjusted significance for any algorithms except for SA, but is still significant in the overall case.

Discussion
Our findings are twofold: (1) we evaluated the performance of several transfer learning algorithms head-to-head on a sleep staging task and (2) we evaluated how well several measures of transferability work for assessing the accuracy achievable when using a particular source dataset and transfer learning algorithm. Transfer learning techniques were tested and compared on a sleep staging task in which a model pre-trained on clinical data was re-trained on data from a wearable device using one of two possible models, two possible sets of layers to re-train, and five different algorithms. Out of all source datasets, architectures and algorithms tested, the highest accuracy and Cohen's κ achieved was 80.0% and .718, respectively, which were obtained using CORAL on the bespoke model when re-training more (four) layers.
With the exception of DDC, the relative performance of each transfer learning algorithm was consistent across different conditions. CORAL and Per-Class CORAL both performed similarly to the baseline performance. SA was the poorest performing in all cases. We speculate that the reduced effectiveness of SA occurs because SA involves projecting the learned features onto a lower-dimensional linear subspace which risks the loss of critical information. DDC usually obtained the highest accuracy when applied to a layer close to the output, but not when applied to a layer further from the output, suggesting that DDC can be effective but is highly sensitive to the layer to which it is applied, and so tuning may be necessary. The higher performance when the loss is applied to layers closer to the output makes sense given the principle behind DDC. That is, DDC works by incentivizing the model to learn a similar hidden-layer representation across both source and target datasets, so if the layer it is applied to already generalizes well between datasets (and layers closer to the input have indeed been found to learn simpler, more generalizable features [48][49][50]), it may be of limited benefit. In contrast, head re-training resulted in the best or second best performance in all training conditions, suggesting it is the most robust choice for obtaining a good (even if not necessarily the best) performance when time and resources available for hyperparameter tuning are limited.
Unlike with the bespoke CNN, All transfer learning methods reduced performance when using DeepSleepNet. This could be attributable to the greater depth of DeepSleep-Net relative to the bespoke CNN, so re-training and domain adaptation at the output layer was inadequate for compensating for how much information from the source dataset had been encoded into the network.
The correlations between accuracy and several measures of transferability were also assessed to determine whether these measures were potentially useful for transfer learning design. Most measures attained significant correlations on at least some transfer learning algorithms, which is consistent with previous research [14,17,18,25]. The search for effective measures of transferability is relevant to machine learning engineers seeking to reduce development time. Using transferability measures to guess which transfer learning methods will work best could avoid the need for exhaustive testing.
When performing transfer learning on the bespoke model with a smaller number of re-trainable layers, the transferability measure most correlated with accuracy overall Page 15 of 25 Waters and Clifford BioMedical Engineering OnLine (2022) 21:66 was the H-score, and the second most correlated was TDAS. On the other two testing conditions however (i.e., transfer learning on DeepSleepNet and transfer learning on the bespoke model with a larger number of re-trainable layers), TDAS outperformed H-score, especially when using DeepSleepNet. Therefore, although the H-score may perform well in some cases, its performance is inconsistent and so TDAS may be the more robust option. TDAS did exhibit some sensitivity to choice of parameter ǫ , however Fernando et al. 's. recommendation of setting ǫ to the median Euclidean distance between target samples [18] outperformed other choices for ǫ in most cases and was the second best choice in the remaining cases, and so can be considered a good heuristic. Despite the high overall correlations of TDAS and H-score, other measures still achieved higher correlations with accuracy for specific algorithms, and so it may be more advisable to make design decisions (such as on which source dataset to pre-train) using the transferability measure most suitable for a particular learning algorithm.
It is important to note that no single transfer learning method performs best in all cases. DDC, for example showed the strongest performance overall when re-training with a smaller number of layers, but still performed worse than the baseline when retraining with a larger number of layers.
Similarly, no transferability measure had significant correlations with the performance of all transfer learning algorithms. H-score, for example was significantly correlated with performance on every transfer learning algorithm on the bespoke architecture when re-training a smaller number of layers except DDC. The poor correlation with DDC is likely because H-score assumes a fixed feature extractor [25], whereas DDC involves fine-tuning the feature extraction layers. We can thus conclude that H-score would be effective as an indicator when using CORAL, but one should use some other transferability measure such as MMD or TDAS when using DDC. Furthermore, H-score achieved significant correlations for very few algorithms when applied to DeepSleepNet, which again, may result from the state-dependence of the LSTM layers violating H-score's assumptions.
Since no transfer learning algorithm or transferability measure performed best in all cases, the results here cannot be taken as a replacement for exhaustive testing. The only guaranteed way to determine which of several methods will work best is to test them. However, when there is limited time and resources available, the results presented in this work can be used to narrow down the list of possible methods to a few with a higher probability of success. When there is insufficient time to experiment with multiple possible transfer learning algorithms, our results indicate that head re-training can lead to the most robust results (i.e., it consistently performs comparably to, or better than other approaches). If multiple pre-trained models are available, our results favor choosing the model which attains the highest TDAS score on the target dataset. When there is limited time for testing, but there are a large number of possible transfer learning algorithms and/or pre-trained models to choose from, we recommend calculating the TDAS value of all possible combinations of algorithms and pre-trained models, selecting the highestscoring combinations, and directly testing them to determine which achieves the highest performance.
More research is needed to determine why some algorithms work better in some cases but not others, but we speculate that performance varies according to whether the fundamental assumptions of the algorithms are met. CORAL works well when the target distribution can be approximated by a linear transformation of the source distribution. DDC works well when the domain shift between the source and target is amplified along the layers of the neural network, possibly due to overfitting to the source dataset at layers closer to the output, in which case the model benefits from an additional loss function which punishes differences in the activations. Subspace Alignment works well when the relevant features lie on a lower-dimensional manifold, in which case projection onto this manifold does not cause significant loss of information.
When the assumptions of the more sophisticated transfer learning methods are not met, the additional constraints, operations and loss functions employed by such algorithms can instead cause loss of information or steer the model away from an optimal solution, in which case the simplest method of head re-training provides the highest performance.
The same is also true for the transferability measures. Performance depends on whether the assumptions of the measures are met. The H-score works well when the network layers closer to the input are fixed, but this assumption is violated by transfer learning methods which involve re-training such layers (such as DDC) or when those layers have state-dependence (such as when using an LSTM layer).
The randomization of samples in the MMD approach may reduce its reliability as a transferability measure through the introduction of stochasticity, but MMD can still out-perform deterministic measures such as TDAS in some cases despite TDAS working better in general. TDAS tallies the number of target samples in close proximity to source samples, but target samples far away from any source samples have little effect on the TDAS score. MMD on the other hand takes all samples into account. As a result, MMD may work better as a measure of transferability when there are many target samples which are distant from source samples or the data exhibit extreme outliers, even if TDAS works better in general. MMD may also out-perform TDAS when training via DDC, as DDC explicitly minimizes MMD between source and target, and so a large MMD may indicate that performance improvements can be made by an algorithm designed to reduce the value of MMD. This is the first work we know of to test transferability measures across multiple transfer learning algorithms, and is also the first work we know of to evaluate transferability measures on a sleep staging task. Our findings also add to the body of research on the use of automation for wearable medical sensors, particularly regarding the use of transfer learning to boost performance [7,[51][52][53]. In particular, our work adds to the body of work on fully supervised transfer learning without the need for tuning models to specific patients.
One limitation was that the test subjects were healthy adults of similar age, and so more testing is necessary to determine the effectiveness of the learned models against older adults and people with sleep disorders. The findings here should also not be considered a conclusive evaluation of which transfer learning algorithms or measures necessarily work best in all instances, as we used a single target domain with similar source domains in a supervised setting. Many of the transfer learning algorithms and transferability measures were developed for computer vision tasks, for an unsupervised/semisupervised setting, or when using engineered instead of learned features, and so more research is necessary to determine whether the results found here are true of other tasks, other source/target combinations, other sensors or other settings. However, our findings do highlight that the source data and target data influence both the type of transfer learning approach and the measure for identifying the best approach.

Conclusion
It was experimentally found that the most widely used transfer learning method (retraining the head layers) was the most robust approach, as it was either the best or second best in all three experimental conditions. DDC however was able to out-perform head re-training in one case, but showed considerable sensitivity to the choice of layers to which it is applied. H-score was correlated best with accuracy in cases where the assumption of a fixed feature extractor is met, but this assumption is violated for transfer learning methods which involve re-training all layers and for architectures with state-dependence. TDAS can be a strong correlate with accuracy across cases, but shows some sensitivity to the choice in the ǫ parameter. Future research directions could include training different layers with different learning rates to see how this impacts each algorithm and measure, investigating the characteristics of the data that make some algorithms more effective than others, and testing in semi-or fully unsupervised settings.

Datasets
CiCC contains 994 healthy subjects or patients experiencing spontaneous arousals, respiratory effort related arousals, bruxism, hypoventilation, hypopneas, apneas, vocalizations, snores, periodic leg movements, Cheyne-Stokes breathing or partial airway obstructions. SHHS contains 6441 healthy subjects or patients with atherosclerosis, airway obstructive diseases or other cardiovascular problem. ISRUC contains 126 healthy subjects and patients with various disorders including REM sleep behavior disorder, obstructive sleep apnea, snoring, periodic limb movement, epilepsy, depression, Parkinson's or insomnia. MrOS contains 2900 healthy subjects and patients with sleep disordered breathing or nocturnal hypoxemia. WSC contains over 1100 healthy subjects and patients with sleep disordered breathing. The MASS dataset contains 200 subjects with apnea/hypopnea indices of up to 20 but were otherwise healthy except for 15 with mild cognitive impairment and 7 with restless leg syndrome. In order to avoid biasing the model towards subjects who had done more recordings, only the first recording from every subject was used when training on datasets in which some subjects had multiple recordings. In all datasets, a single EEG channel was used. Either the C3-M2 or C3-A2 EEG channel was used (depending on which was available) in the source datasets in order to keep the signal content as similar as possible to the channels available in the Page 18 of 25 Waters and Clifford BioMedical Engineering OnLine (2022) 21:66 target dataset, which were across the forehead. The target dataset contains 24 subjects with 1-6 recordings each (75 recordings total) using an X4 Sleep Profiler (Advanced Brain Monitoring, Carlsbad, CA). The X4 consists of a headband and several sensors across the forehead. A single channel (AF7-Fpz) was used. Ground truth sleep stage labels were manually determined by a human sleep staging technician. Subjects were volunteers recruited from Georgia Tech's graduate student body and from Emory's Biomedical Informatics department. As with the source dataset, no more than two recordings per subject were included in the training set in order to avoid biasing the model towards subjects who volunteered for more recordings. However, no data were excluded during evaluation.

Training and testing procedure
To evaluate the effectiveness of each transfer learning algorithm, six base models using each of the two architectures (12 models total) were pre-trained on one of the six source datasets before being re-trained on the target dataset using one of the transfer learning algorithms described above. The pre-training procedure for the open-source model was done using the same methods and parameters described in [43]. The bespoke model was pre-trained using ADAM with an initial learning rate of 0.001 either for 1000 epochs or until the early stopping criteria (running 30 epochs without obtaining more than a .1% improvement in accuracy on a validation set) were met, whichever came first. Dropout at a rate of .5 was applied to the fully connected layer and all layers used an L2 regularization weight 10 −6.9 (values found using Bayesian hyperparameter tuning). Transfer learning hyperparameters for both architectures were the same as the hyperparameters used in pre-training the bespoke model. For source datasets where some subjects had multiple recordings, only the first recording from each subject was used in order to avoid biasing the model towards the subjects with multiple recordings. Some of the transfer learning algorithms required training on source and target simultaneously; in these cases, a subset of source recordings equal to the number of target recordings were selected so as to maintain a balanced number of source and target samples. The subset of source recordings was selected by choosing every nth recording in the order in which they were numbered to ensure the subset was representative of the full dataset. The average accuracy and Cohen's κ were obtained using leave-one-subject-out cross-validation. The number of times a particular transfer learning algorithm outperformed every other learning algorithm for a particular validation subject was also tallied. A paired t-test was used to determine whether each transfer learning algorithm significantly outperformed the baseline method of head re-training. Transferability measures were evaluated only on the target subjects used in trainingmeasures were not evaluated on the subject being left out for validation. The relationship between each transferability measure and the performance of the re-trained model was evaluated using Spearman's rank correlation between the transferability measure and the accuracy of the pre-trained model on the validation set. A high correlation with accuracy suggests the transferability measure is a reasonable indicator of how well a particular model will perform on the dataset. The reason Spearman's correlation was used instead of Pearson's correlation is because the transferability measures do not necessarily increase linearly with the accuracy of the trained model. The overall correlation between accuracy and each transferability measure is found along with the correlation for each individual transfer learning algorithm.

Inclusion of women and minorities
In compliance with Sex and Gender Equity in Research (SAGER) guidelines, we report the demographic makeup of all datasets. MrOS is the only dataset which is entirely male. ISRUC is 40% female (47 female patients total), WSC is 46% female (515 patients total), CiCC is 34% female (227 female patients total), and SHHS is 52% female (3039 patients total). The SHHS is 86% White, 9% Black, and 7% Other. MrOS is 93% White, 4% Black, 3% Asian, 0.1% Native American, Native Hawaiian or Native Pacific Islander, 1% Multiracial and 2% Unknown. The WSC is 94% White, 2% Black, 1% Asian, 1% Hispanic and 1% Native American. CiCC and ISRUC did not report the racial distribution, but are obtained from hospital PSG records without regard for race, and are thus expected to reflect the racial distribution of patients referred to sleep labs.

Base models
Two base models were used-one bespoke model of our own design (Fig 1) and an open-source architecture, DeepSleepNet [43] (Fig 2). DeepSleepNet was selected due to its state-of-the-art performance on sleep staging tasks, for its frequent use as a basis of comparison by other papers [22,[44][45][46], and for having a much larger, and differing architecture from our bespoke model. Furthermore, DeepSleepNet uses a combination of both convolutional and recurrent layers. The convolutional layers extract features from one epoch while the recurrent layers take temporal information into account (i.e., the score of one epoch is used for determining the score of the subsequent epoch). Most state-of-the-art architectures employ a similar paradigm of combining feature-extracting convolutional layers with recurrent layers or transformer mechanisms [45][46][47][54][55][56][57][58][59][60], and so DeepSleepNet is representative of state-of-the-art methods. For the bespoke model, 4 convolutional layers were used, with each convolutional layer being followed by a ReLU, Max Pooling and Batch Normalization layer. The input was the short-time Fourier transform of the EEG. The head of the model was a single dense layer. Dropout was  used on the dense layer. Training was done using the Adam optimizer [61]. The dropout fraction, L2 regularization and number of filters in each layer were found using Bayesian hyperparamter tuning on the SHHS dataset. Training and testing on the clinical datasets showed the base model to be capable of performance on par with that of human technicians (Table 13). DeepSleepNet consists of two branches containing a series of convolutional, batch normalization, ReLU, max pooling and dropout layers which then merge before being fed into two more separate branches, one containing a fully connected layer and the other containing two consecutive bi-directional LSTM layers which are each followed by a dropout layer. The two branches are then merged and fed into a softmax layer. The input is raw EEG. DeepSleepNet uses two phases of training. In the first phase, the convolutional layers without the LSTM layers are trained for 100 epochs on a version of the dataset which was class-balanced by randomly duplicating samples from minority samples. In the second phase, the LSTM and final softmax layers were added and the model trained again for 200 more epochs on the original imbalanced dataset in batches of 25 consecutive epochs. In the original paper, the performance of DeepSleepNet is evaluated on the test set every epoch and the model weights from the epoch which achieved the highest accuracy on the test set are used to test the model again on the same subject, with the only difference being that the model states are re-set in between each 25-epoch batch during training. DeepSleepNet was trained on each source dataset using the most of the same code and parameters used in the original DeepSleepNet paper, but transfer learning was performed using the most of the same code and parameters as our bespoke model. The subjects used for early stopping during re-training are separate from the subjects used for testing. During pre-training, 1% of subjects are separated from the rest of the source subjects and used to decide at what training epoch to load the highestperforming weights from.

Re-training larger numbers of layers
To test how the choice of layers to re-train effects the performance of each algorithm and transferability measure, each algorithm was applied to the convolutional layer closest to the output in addition to just the dense layer. For the head re-training algorithm, the dense layer, the nearby convolutional layer and each of their respective batch normalization layers were re-trained while the other layers were frozen. For CORAL, Per-Class CORAL, and SA, the same layers were frozen and re-trained as with Head Retrain, but an additional domain adaptation was applied to the output of the frozen layers. For DDC, all layers were re-trained, but the MMD loss was applied to input to the max pooling layer second closest to the output layer.
Note that only one set of layers was tested for DeepSleepNet. Because DeepSleep-Net makes use of multiple branches which split and re-join, there is no location within the architecture to apply the domain adaptation/loss function which can be compared to a similar location within the more linear bespoke CNN except for the output layer.